Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[restatectl] Improve performance and information display with unprovisioned clusters or under partial node connectivity #2748

Merged
merged 5 commits into from
Feb 17, 2025

Conversation

pcholakov
Copy link
Contributor

@pcholakov pcholakov commented Feb 17, 2025

This PR aims to improve the overall responsiveness of restatectl with clusters that are still busy provisioning or otherwise only partially connected to the host running the tool.

  • iterating over the nodes for nodes config or metadata remembers unresponsive nodes
  • Metadata status polls the known metadata role servers concurrently, respecting unreachable nodes flagged by prior connections
  • overall CLI connect timeout is reduced to 3s (from 5s)
  • we now print a heading for the Metadata section of restatectl status to visually separate it from the others
  • general sprinkling of debug logging

Further work:

  • update get_latest_metadata to gather IdentResponses concurrently and cache them (so they can be reused between get nodes and logs)
  • be smarter with contacting nodes based on the returned status

Testing

Single alive node of a three-node cluster (not yet provisioned):

❯ rc st -s node1.cluster.orb.local:5122
Node Configuration (v3)
 NODE  GEN  NAME   ADDRESS                               ROLES                                         
 N1    2    node1  http://node1.cluster.orb.local:5122/  admin | log-server | metadata-server | worker 

Logs v1
└ Logs Provider: replicated
 ├ Log replication: {node: 2}
 └ Nodeset size: 0
No logs found. Has the cluster been provisioned yet?

Alive partition processors (nodes config v3, partition table v1)
 P-ID  NODE  MODE  STATUS  LEADER  EPOCH  SEQUENCER  APPLIED-LSN  PERSISTED-LSN  SKIPPED-RECORDS  ARCHIVED-LSN  LAST-UPDATE 

Metadata service nodes
 NODE  STATUS  VERSION  LEADER  MEMBERS  APPLIED  COMMITTED  TERM  LOG-LENGTH  SNAP-INDEX  SNAP-SIZE 
 N1    Member  v1       N1      [N1]     2        2          2     1           1           472 B    

Single dead node:

❯ time rc st -s node2.cluster.orb.local:5122
Error: Encountered multiple errors:
 - http://node2.cluster.orb.local:5122/ -> status: Unavailable, message: "tcp connect error: deadline has elapsed", details: [], metadata: MetadataMap { headers: {} }

restatectl st -s node2.cluster.orb.local:5122  0.03s user 0.01s system 0% cpu 5.348 total

Provisioned cluster with two alive and one dead nodes:

❯ time rc st -s node1.cluster.orb.local:5122
Node Configuration (v9)
 NODE  GEN  NAME   ADDRESS                               ROLES                                         
 N1    2    node1  http://node1.cluster.orb.local:5122/  admin | log-server | metadata-server | worker 
 N2    1    node2  http://node2.cluster.orb.local:5122/  admin | log-server | metadata-server | worker 
 N3    1    node3  http://node3.cluster.orb.local:5122/  admin | log-server | metadata-server | worker 

Logs v3
└ Logs Provider: replicated
 ├ Log replication: {node: 2}
 └ Nodeset size: 0
 L-ID  FROM-LSN  KIND        LOGLET-ID  REPLICATION  SEQUENCER  NODESET      
 0     2         Replicated  0_1        {node: 2}    N2:1       [N1, N2, N3] 
 1     2         Replicated  1_1        {node: 2}    N2:1       [N1, N2, N3] 
 2     2         Replicated  2_1        {node: 2}    N1:2       [N1, N2, N3] 
 3     2         Replicated  3_1        {node: 2}    N1:2       [N1, N2, N3] 
...

Alive partition processors (nodes config v9, partition table v4)
 P-ID  NODE  MODE      STATUS  LEADER  EPOCH  SEQUENCER  APPLIED-LSN  PERSISTED-LSN  SKIPPED-RECORDS  ARCHIVED-LSN  LAST-UPDATE             
 0     N1:2  Follower  Active  N2:1    e1                1            -              0                -             582 ms ago              
 0     N2:1  Leader    Active  N2:1    e1                1            -              0                -             802 ms ago              
 1     N1:2  Follower  Active  N2:1    e1                1            -              0                -             1 second and 92 ms ago  
 1     N2:1  Leader    Active  N2:1    e1                1            -              0                -             802 ms ago              
 2     N1:2  Leader    Active  N1:2    e1                1            -              0                -             517 ms ago              
 2     N2:1  Follower  Active  N1:2    e1                1            -              0                -             608 ms ago              
 3     N1:2  Leader    Active  N1:2    e1                1            -              0                -             971 ms ago              
 3     N2:1  Follower  Active  N1:2    e1                1            -              0                -             923 ms ago              
...

☠️ Dead nodes
 NODE  LAST-SEEN                           
 N3    1 minute, 33 seconds and 503 ms ago 

Metadata service nodes
 NODE  STATUS  VERSION  LEADER  MEMBERS     APPLIED  COMMITTED  TERM  LOG-LENGTH  SNAP-INDEX  SNAP-SIZE 
 N2    Member  v3       N1      [N1,N2,N3]  41       41         2     4           37          8.8 kiB   
 N1    Member  v3       N1      [N1,N2,N3]  41       41         2     4           37          8.8 kiB   

🔌 Unreachable nodes
 NODE  REASON                                                                                              
 N3    status: Unknown, message: "Node is unreachable", details: [], metadata: MetadataMap { headers: {} } 
restatectl st -s node1.cluster.orb.local:5122  0.06s user 0.02s system 1% cpu 5.405 total

~/restate/restate feat/restatectl-node-connections* 5s ❯ 

Copy link

github-actions bot commented Feb 17, 2025

Test Results

  7 files  ±0    7 suites  ±0   4m 15s ⏱️ -1s
 47 tests ±0   46 ✅ ±0  1 💤 ±0  0 ❌ ±0 
182 runs  ±0  179 ✅ ±0  3 💤 ±0  0 ❌ ±0 

Results for commit f3208bf. ± Comparison against base commit a470649.

♻️ This comment has been updated with latest results.

@pcholakov pcholakov changed the title Feat/restatectl node connections [restatectl] Improve performance and information display with unprovisioned clusters or under partial node connectivity Feb 17, 2025
@pcholakov pcholakov force-pushed the feat/restatectl-node-connections branch from 484c27f to 0fc6cb9 Compare February 17, 2025 11:56
@pcholakov pcholakov force-pushed the feat/restatectl-node-connections branch from 97368a9 to f3208bf Compare February 17, 2025 12:38
@pcholakov pcholakov marked this pull request as ready for review February 17, 2025 12:38
Copy link
Contributor

@muhamadazmy muhamadazmy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @pcholakov for those really nice improvements. I like also getting the metadata status in parallel. It made me think that there is probably more things that can be done in parallel, including fetching the ConnectionInfo::get_latest_metadata. What do you think?

@pcholakov
Copy link
Contributor Author

Yeah, absolutely! I'd definitely love to continue this by fetching metadata in parallel. I can imagine a scatter-gather version ConnectionInfo::try_each, for example, which reaches out to the desired number of nodes concurrently, and adds more tasks if needed if some responses from the initial batch fail. Lots of room for improvement :-) I also started down a path of creating a standalone ConnectionCache struct but it turned out that I needed to lock the connection cache and dead node set separately. Definitely some room for further evolution there, too!

@pcholakov pcholakov merged commit fbc82b5 into main Feb 17, 2025
29 checks passed
@pcholakov pcholakov deleted the feat/restatectl-node-connections branch February 17, 2025 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants